In the last post, I have acquired the data from Amazon and done pre-processing and converted into corpus and done a word cloud. In this blog I plan to tidy data more and analysis data using visualizations.
Loading the libraries
Code
library(polite)library(rvest)
Warning: package 'rvest' was built under R version 4.2.2
Code
library(ggplot2)
Warning: package 'ggplot2' was built under R version 4.2.2
Code
library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
Attaching package: 'quanteda.sentiment'
The following object is masked from 'package:quanteda':
data_dictionary_LSD2015
Code
knitr::opts_chunk$set(echo =TRUE)
Reading the data
Code
reviews <-read_csv("amazonreview.csv")
New names:
Rows: 46450 Columns: 6
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(4): review_title, review_text, review_star, ASIN dbl (2): ...1, page
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
Code
reviews
Summary of the data
Code
summary(reviews)
...1 review_title review_text review_star
Min. : 1 Length:46450 Length:46450 Length:46450
1st Qu.:11613 Class :character Class :character Class :character
Median :23226 Mode :character Mode :character Mode :character
Mean :23226
3rd Qu.:34838
Max. :46450
page ASIN
Min. : 1.0 Length:46450
1st Qu.: 97.0 Class :character
Median :194.0 Mode :character
Mean :195.2
3rd Qu.:291.0
Max. :400.0
Code
str(reviews)
spc_tbl_ [46,450 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ ...1 : num [1:46450] 1 2 3 4 5 6 7 8 9 10 ...
$ review_title: chr [1:46450] "\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n \r\n \r\n Refreshing and Edgy with Great Characters\r\n \r\n" "\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n \r\n \r\n Well plotted and paced; excellent, fresh fantasy tale\r\n \r\n" "\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n \r\n \r\n Compelling and Riveting!\r\n \r\n" "\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n \r\n \r\n Let the Games Begin\r\n \r\n" ...
$ review_text : chr [1:46450] "\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n \r\n \r\n I love fantasy; ever since I was a kid, stories set in creative w"| __truncated__ "\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n \r\n \r\n First off, I'm a heavy duty fan of GRRM. I've read over a 100 dif"| __truncated__ "\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n \r\n \r\n I am writing this review as an individual who watched Game of Thr"| __truncated__ "\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n \r\n \r\n If you're going to consider reading `The Song of Ice and Fire' se"| __truncated__ ...
$ review_star : chr [1:46450] "5.0 out of 5 stars" "5.0 out of 5 stars" "5.0 out of 5 stars" "5.0 out of 5 stars" ...
$ page : num [1:46450] 1 1 1 1 1 1 1 1 1 1 ...
$ ASIN : chr [1:46450] "B0001DBI1Q" "B0001DBI1Q" "B0001DBI1Q" "B0001DBI1Q" ...
- attr(*, "spec")=
.. cols(
.. ...1 = col_double(),
.. review_title = col_character(),
.. review_text = col_character(),
.. review_star = col_character(),
.. page = col_double(),
.. ASIN = col_character()
.. )
- attr(*, "problems")=<externalptr>
clean_text <-function (text) {str_remove_all(text," ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)") %>%# Remove mentionsstr_remove_all("@[[:alnum:]_]*") %>%# Replace "&" character reference with "and"str_replace_all("&", "and") %>%# Remove punctuationstr_remove_all("[[:punct:]]") %>%# remove digitsstr_remove_all("[[:digit:]]") %>%# Replace any newline characters with a spacestr_replace_all("\\\n|\\\r", " ") %>%# remove strings like "<U+0001F9F5>"str_remove_all("<.*?>") %>%# Make everything lowercasestr_to_lower() %>%# Remove any trailing white space around the text and inside a stringstr_squish()}
p <- reviews %>%group_by(review_star) %>%ggplot(aes(review_star)) +geom_bar() +ggtitle("Frequency per star")ggplotly(p)
Warning: The following aesthetics were dropped during statistical transformation:
x_plotlyDomain
ℹ This can happen when ggplot fails to infer the correct grouping structure in
the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?
Unique set of ASIN numbers
Code
reviews %>%select(ASIN) %>%unique()
Adding new variable book title to the reviews
Code
reviews <- reviews %>%mutate(book_title =case_when(ASIN =="B0001DBI1Q"~"A Game of Thrones: A Song of Ice and Fire, Book 1", ASIN =="B0001MC01Y"~"A Clash of Kings: A Song of Ice and Fire, Book 2", ASIN =="B00026WUZU"~"A Storm of Swords: A Song of Ice and Fire, Book 3", ASIN =="B07ZN4WM13"~"A Feast for Crows: A Song of Ice and Fire, Book 4", ASIN =="B005C7QVUE"~"A Dance with Dragons: A Song of Ice and Fire, Book 5", ASIN =="B000BO2D64"~"Twilight: The Twilight Saga, Book 1", ASIN =="B000I2JFQU"~"New Moon: The Twilight Saga, Book 2", ASIN =="B000UW50LW"~"Eclipse: The Twilight Saga, Book 3", ASIN =="B001FD6RLM"~"Breaking Dawn: The Twilight Saga, Book 4 ", ASIN =="B07HHJ7669"~"The Hunger Games", ASIN =="B07T6BQV2L"~"Catching Fire: The Hunger Games", ASIN =="B07T43YYRY"~"Mockingjay: The Hunger Games, Book 3"))reviews
p <- reviews %>%group_by(book_title,review_star) %>%count() %>%ggplot(aes(review_star, n, color = book_title)) +geom_line() +ggtitle("Book vs Freq of each star") +xlab("Type of star") +ylab("Frequency")ggplotly(p)
The plot is not much clear, let’s plot it individually to be more clear
Code
p <- reviews %>%group_by(book_title,review_star) %>%count() %>%ggplot(aes(review_star, n)) +geom_line() +facet_wrap(vars(book_title), ncol =2) +ggtitle("Book vs Freq of each star") +xlab("Type of star") +ylab("Frequency")ggplotly(p)
Adding new variable series title to the reviews
Code
reviews <- reviews %>%mutate(series_title =case_when(ASIN =="B0001DBI1Q"~"A Song of Ice and Fire", ASIN =="B0001MC01Y"~"A Song of Ice and Fire", ASIN =="B00026WUZU"~"A Song of Ice and Fire", ASIN =="B07ZN4WM13"~"A Song of Ice and Fire", ASIN =="B005C7QVUE"~"A Song of Ice and Fire", ASIN =="B000BO2D64"~"The Twilight Saga", ASIN =="B000I2JFQU"~"The Twilight Saga", ASIN =="B000UW50LW"~"The Twilight Saga", ASIN =="B001FD6RLM"~"The Twilight Saga", ASIN =="B07HHJ7669"~"The Hunger Games", ASIN =="B07T6BQV2L"~"The Hunger Games", ASIN =="B07T43YYRY"~"The Hunger Games"))reviews
p <- reviews %>%group_by(series_title,review_star) %>%count() %>%ggplot(aes(review_star, n, color = series_title)) +geom_line() +ggtitle("Series vs Freq of each star") +xlab("Type of star") +ylab("Frequency")ggplotly(p)
Tokenization of data
Code
# Conerting the text into corpustext_corpus <-corpus(c(reviews$clean_text)) # Converting the text into tokenstext_token <-tokens(text_corpus, remove_punct=TRUE, remove_numbers =TRUE) %>%tokens_select(pattern=stopwords("en"), selection="remove")text_token
# Finding the frequency of each wordword_counts <-as.data.frame(sort(colSums(text_dfm),dec=T))colnames(word_counts) <-c("Frequency")word_counts$word <-row.names(word_counts)word_counts$Rank <-c(1:ncol(text_dfm))word_counts
# Trimming the dfm text_df <-dfm_trim(text_dfm, min_termfreq =50, docfreq_type ="prop")# create fcm from dfmtext_fcm <-fcm(text_df)text_fcm
Feature co-occurrence matrix of: 3,795 by 3,795 features.
features
features love fantasy ever since kid stories set creative worlds groups
love 13025 2234 3670 2395 337 1701 1155 224 161 72
fantasy 0 1779 884 584 45 554 456 76 99 20
ever 0 0 722 758 90 408 347 64 55 17
since 0 0 0 455 60 271 309 36 28 15
kid 0 0 0 0 48 23 27 6 5 5
stories 0 0 0 0 0 313 190 28 33 33
set 0 0 0 0 0 0 237 26 29 11
creative 0 0 0 0 0 0 0 22 3 2
worlds 0 0 0 0 0 0 0 0 8 4
groups 0 0 0 0 0 0 0 0 0 2
[ reached max_feat ... 3,785 more features, reached max_nfeat ... 3,785 more features ]
Network plot
Code
# pull the top featurestop_features <-names(topfeatures(text_fcm, 50))# retain only those top features as part of our matrixeven_text_fcm <-fcm_select(text_fcm, pattern = top_features, selection ="keep")# compute size weight for vertices in networksize <-log(colSums(even_text_fcm))# create plottextplot_network(even_text_fcm, vertex_size = size /max(size) *2)
I will try sentimental analysis using multiple lexicons and compare which is more suitable.
Source Code
---title: "Blog Post 4"author: "Mani Shanker Kamarapu"desription: "Tidying and Visualizing data"date: "11/5/2022"format: html: df-print: paged css: styles.css toc: true code-fold: true code-copy: true code-tools: truecategories: - Post4 - ManiShankerKamarapu - Amazon Review analysis---## IntroductionIn the last post, I have acquired the data from Amazon and done pre-processing and converted into corpus and done a word cloud. In this blog I plan to tidy data more and analysis data using visualizations.## Loading the libraries```{r}library(polite)library(rvest)library(ggplot2)library(plotly)library(tidyverse)library(stringr)library(quanteda)library(tidyr)library(RColorBrewer)library(quanteda.textplots)library(wordcloud)library(wordcloud2)library(devtools)library(quanteda.dictionaries)library(quanteda.sentiment)knitr::opts_chunk$set(echo =TRUE)```## Reading the data```{r}reviews <-read_csv("amazonreview.csv")reviews```## Summary of the data```{r}summary(reviews)``````{r}str(reviews)``````{r}glimpse(reviews)```## Pre-processing function```{r}clean_text <-function (text) {str_remove_all(text," ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)") %>%# Remove mentionsstr_remove_all("@[[:alnum:]_]*") %>%# Replace "&" character reference with "and"str_replace_all("&", "and") %>%# Remove punctuationstr_remove_all("[[:punct:]]") %>%# remove digitsstr_remove_all("[[:digit:]]") %>%# Replace any newline characters with a spacestr_replace_all("\\\n|\\\r", " ") %>%# remove strings like "<U+0001F9F5>"str_remove_all("<.*?>") %>%# Make everything lowercasestr_to_lower() %>%# Remove any trailing white space around the text and inside a stringstr_squish()}```## Tidying the data```{r}reviews$clean_text <-clean_text(reviews$review_text) reviews <- reviews %>%drop_na(clean_text)reviews```Removing unnecessary columns```{r}reviews <- reviews %>%select(-c(...1, page, review_text))reviews```Pre-processing the title variable```{r}reviews$review_title <- reviews$review_title %>%str_remove_all("\n")reviews```Converting star of reviews from character to numeric ```{r}reviews$review_star <-substr(reviews$review_star, 1, 3) %>%as.numeric() reviews```Frequency of stars```{r}reviews %>%group_by(review_star) %>%count()``````{r}p <- reviews %>%group_by(review_star) %>%ggplot(aes(review_star)) +geom_bar() +ggtitle("Frequency per star")ggplotly(p)```Unique set of ASIN numbers```{r}reviews %>%select(ASIN) %>%unique()```## Adding new variable book title to the reviews```{r}reviews <- reviews %>%mutate(book_title =case_when(ASIN =="B0001DBI1Q"~"A Game of Thrones: A Song of Ice and Fire, Book 1", ASIN =="B0001MC01Y"~"A Clash of Kings: A Song of Ice and Fire, Book 2", ASIN =="B00026WUZU"~"A Storm of Swords: A Song of Ice and Fire, Book 3", ASIN =="B07ZN4WM13"~"A Feast for Crows: A Song of Ice and Fire, Book 4", ASIN =="B005C7QVUE"~"A Dance with Dragons: A Song of Ice and Fire, Book 5", ASIN =="B000BO2D64"~"Twilight: The Twilight Saga, Book 1", ASIN =="B000I2JFQU"~"New Moon: The Twilight Saga, Book 2", ASIN =="B000UW50LW"~"Eclipse: The Twilight Saga, Book 3", ASIN =="B001FD6RLM"~"Breaking Dawn: The Twilight Saga, Book 4 ", ASIN =="B07HHJ7669"~"The Hunger Games", ASIN =="B07T6BQV2L"~"Catching Fire: The Hunger Games", ASIN =="B07T43YYRY"~"Mockingjay: The Hunger Games, Book 3"))reviews```Frequency of each book title```{r}reviews %>%group_by(book_title) %>%count()```Frequency of stars for each book title```{r}reviews %>%group_by(book_title) %>%summarise(star =sum(review_star))``````{r}reviews %>%group_by(book_title) %>%summarise(star =sum(review_star)) %>%ggplot(aes(book_title, star)) +geom_boxplot() +theme(axis.text.x =element_text(angle =30, vjust =1, hjust=1)) +ggtitle("No of stars per book")```Frequency of each type of star per book title```{r}reviews %>%group_by(book_title,review_star) %>%count()``````{r}p <- reviews %>%group_by(book_title,review_star) %>%count() %>%ggplot(aes(review_star, n, color = book_title)) +geom_line() +ggtitle("Book vs Freq of each star") +xlab("Type of star") +ylab("Frequency")ggplotly(p)```The plot is not much clear, let's plot it individually to be more clear```{r}p <- reviews %>%group_by(book_title,review_star) %>%count() %>%ggplot(aes(review_star, n)) +geom_line() +facet_wrap(vars(book_title), ncol =2) +ggtitle("Book vs Freq of each star") +xlab("Type of star") +ylab("Frequency")ggplotly(p)```Adding new variable series title to the reviews```{r}reviews <- reviews %>%mutate(series_title =case_when(ASIN =="B0001DBI1Q"~"A Song of Ice and Fire", ASIN =="B0001MC01Y"~"A Song of Ice and Fire", ASIN =="B00026WUZU"~"A Song of Ice and Fire", ASIN =="B07ZN4WM13"~"A Song of Ice and Fire", ASIN =="B005C7QVUE"~"A Song of Ice and Fire", ASIN =="B000BO2D64"~"The Twilight Saga", ASIN =="B000I2JFQU"~"The Twilight Saga", ASIN =="B000UW50LW"~"The Twilight Saga", ASIN =="B001FD6RLM"~"The Twilight Saga", ASIN =="B07HHJ7669"~"The Hunger Games", ASIN =="B07T6BQV2L"~"The Hunger Games", ASIN =="B07T43YYRY"~"The Hunger Games"))reviews```Frequency of each series title```{r}reviews %>%group_by(series_title) %>%count()```Frequency of stars by series```{r}reviews %>%group_by(series_title) %>%summarise(star =sum(review_star))``````{r}reviews %>%group_by(series_title) %>%summarise(star =sum(review_star)) %>%ggplot(aes(series_title, star)) +geom_boxplot() +theme(axis.text.x =element_text(angle =30, vjust =1, hjust=1)) +ggtitle("No of stars per series")```Frequency of each type of star per series title```{r}reviews %>%group_by(series_title,review_star) %>%count()``````{r}p <- reviews %>%group_by(series_title,review_star) %>%count() %>%ggplot(aes(review_star, n, color = series_title)) +geom_line() +ggtitle("Series vs Freq of each star") +xlab("Type of star") +ylab("Frequency")ggplotly(p)```## Tokenization of data```{r}# Conerting the text into corpustext_corpus <-corpus(c(reviews$clean_text)) # Converting the text into tokenstext_token <-tokens(text_corpus, remove_punct=TRUE, remove_numbers =TRUE) %>%tokens_select(pattern=stopwords("en"), selection="remove")text_token``````{r}# Converting tokens into Document feature matrixtext_dfm <-dfm(text_token)text_dfm``````{r}# Total no of tokenssum(ntoken(text_token))``````{r}# Summary of the corpussummary(text_corpus)``````{r}# Finding the frequency of each wordword_counts <-as.data.frame(sort(colSums(text_dfm),dec=T))colnames(word_counts) <-c("Frequency")word_counts$word <-row.names(word_counts)word_counts$Rank <-c(1:ncol(text_dfm))word_counts ``````{r}word_counts %>%head(40) %>%mutate(word =reorder(word, Frequency)) %>%ggplot(aes(word, Frequency)) +geom_bar(stat ="identity") +ylab("Occurrences") +coord_flip()``````{r}# Trimming the dfm text_df <-dfm_trim(text_dfm, min_termfreq =50, docfreq_type ="prop")# create fcm from dfmtext_fcm <-fcm(text_df)text_fcm```## Network plot```{r}# pull the top featurestop_features <-names(topfeatures(text_fcm, 50))# retain only those top features as part of our matrixeven_text_fcm <-fcm_select(text_fcm, pattern = top_features, selection ="keep")# compute size weight for vertices in networksize <-log(colSums(even_text_fcm))# create plottextplot_network(even_text_fcm, vertex_size = size /max(size) *2)```## Wordcloud```{r}textplot_wordcloud(text_dfm, min_size =1.5, max_size =4, random_order =TRUE, max_words =150, min_count =50, color =brewer.pal(8, "Dark2") )```## Further studyI will try sentimental analysis using multiple lexicons and compare which is more suitable.